Segmenting Documents using Multiple Lexical Features

نویسندگان

  • Amanda C. Jobbins
  • Lindsay J. Evett
چکیده

A method is presented for segmenting documents into conceptually related areas. Determining the equivalence of text is often based on the number of word repetitions. This approach is unsuitable for detecting short segments because terms tend not to be repeated across just a few sentences. In this paper we investigate the contribution of two other lexical features to find related words: collocation and relation weights (which identify semantic relations). An experiment was conducted on a set of test data with known topic changes; performances of the three features were independently compared. A combination of all features was the most reliable indicator of a topic change. In another experiment, CNN news summaries were segmented into their individual news stories. Precision and recall rates of around 90% are reported for news story boundary detection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

رویکردی با ناظر در استخراج واژگان کلیدی اسناد فارسی با استفاده از زنجیره‌های لغوی

Keywords are the main focal points of interest within a text, which intends to represent the principal concepts outlined in the document. Determining the keywords using traditional methods is a time consuming process and requires specialized knowledge of the subject. For the purposes of indexing the vast expanse of electronic documents, it is important to automate the keyword extraction task. S...

متن کامل

Segmenting Broadcast News Streams using Lexical Chains

In this paper we propose a course-grained NLP approach to text segmentation based on the analysis of lexical cohesion within text. Most work in this area has focused on the discovery of textual units that discuss subtopic structure within documents. In contrast our segmentation task requires the discovery of topical units of text i.e. distinct news stories from broadcast news programmes. Our sy...

متن کامل

Using Machine Learning Techniques for Subjectivity Analysis based on Lexical and Non-lexical Features

Machine learning techniques have been used to address various problems and classification of documents is one of the main applications of such techniques. Opinion mining has emerged as an active research domain due to its wide range of applications such as multi-document summarization, opinion mining of documents and users’ reviews analysis improving answers of opinion questions in forums. Exis...

متن کامل

Mongolian Named Entity Recognition System with Rich Features

In this paper, we first build a manually annotated named entity corpus of Mongolian. Then, we propose three morphological processing methods and study comprehensive features, including syllable features, lexical features, context features, morphological features and semantic features in Mongolian named entity recognition. Moreover, we also evaluate the influence of word cluster features on the ...

متن کامل

Lexical Chains as Document Features

Document clustering and classification is usually done by representing the documents using a bag of words scheme. This scheme ignores many of the linguistic and semantic features contained in text documents. We propose here an alternative representation for documents using Lexical Chains. We compare the performance of the new representation against the old one on a clustering task. We show that...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999